feat(orchestrator): per-env sample strategy + env-mix seam#2722
Draft
hallerite wants to merge 1 commit into
Draft
feat(orchestrator): per-env sample strategy + env-mix seam#2722hallerite wants to merge 1 commit into
hallerite wants to merge 1 commit into
Conversation
Introduce the per-env sampling seam. Each train env owns a `SampleStrategy` (what example to serve, plus an `observe()` feedback hook); env selection is delegated to a swappable `EnvMixStrategy`. Defaults reproduce today's behavior (weighted round-robin over per-env reshuffling-cursor datasets). - `orchestrator/sampling.py` (new): SampleStrategy + ShuffledCursorSampler; EnvMixStrategy + WeightedRoundRobin. - TrainEnv owns its dataset via `build_sampler()` and holds `.sampler`. - TrainSource slims to env-mix + per-env samplers. - TrainSink.process_group calls `env.sampler.observe(survivors)` after advantages (no-op default) — the feedback wire for curriculum / replay samplers. Behavior-equivalent; RNG partitioned per-env + mix. Stacked on feat/per-env-advantage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Introduce the per-env sampling seam: each training env owns a
SampleStrategy(what example to serve, plus anobserve()feedback hook), and env selection is delegated to a swappableEnvMixStrategy. Defaults reproduce today's behavior; this is the foundation for curriculum / replay samplers.Why
TrainSourcepreviously hard-owned dataset iteration and env selection in one class, with no way to (a) plug a different per-env example-selection policy or (b) feed rollout outcomes back to the sampler. Splitting these into per-env + global strategies — and routing scored groups back viaobserve()— is what makes curriculum learning and (later) replay expressible without touching the dispatcher/perf path.Changes
orchestrator/sampling.py(new):SampleStrategyABC +ShuffledCursorSamplerdefault (per-env: shuffle rows once, walk a reshuffling cursor);EnvMixStrategyABC +WeightedRoundRobindefault (which env next).TrainEnvnow owns its dataset viabuild_sampler()and holds a.sampler— reachable by both the source (pull) and the sink (observe).TrainSourceshrinks to: build per-env samplers +EnvMixStrategy;next_examplepicks an env then pulls from that env's sampler. (Folds in the env-mix extraction — the "slice b" seam.)TrainSink.process_groupcallsenv.sampler.observe(survivors)after advantages are assigned — the feedback wire (no-op for the default sampler).Behavior
Behavior-equivalent to before: a weighted round-robin over per-env datasets that are each shuffled once and walked with a reshuffling cursor. The default
observeis a no-op, so default runs train identically. (RNG is now partitioned per-env + mix rather than one shared generator, so the exact example sequence differs from before — same distribution, arbitrary seed; nothing depends on the old ordering.)Testing
tests/unit/orchestrator/test_sampling.py(new, 8 tests): cursor cycles-without-repeats-then-reshuffles, determinism per seed, empty-rows guard,observeno-op, weighted-RR distribution + determinism.ruff check+format --checkclean; existingtest_advantage.py(17) +test_configs.py(106) still pass.reverse_textRL run with two envs (rt-grpo,rt-lenpen) — both sampled every step through the newEnvMixStrategy+ per-env samplers (varying ratios), trained cleanly (Error 0.0%, exit 0), with theobserve()wire firing per group.🤖 Generated with Claude Code